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and that the majority (90-99%) of these have little in common with known biologically 
extant compositions. 

The aim of the present invention is to attempt to alleviate some of the above 
described problems arid to reduce the large number of irrelevant compositions retumed 
by existing tools. 

Any discussion of documents, acts, materials, devices, articles or the like which 
has been included in the present specification is solely for the purpose of providing a 
context for the present invention. It is not to be taken as an admission that any or all of 
these matters form part of the prior art base or were common general knowledge in the 
field relevant to the present invention as it existed before the priority date of each claim 
of this application. 

Summary of the Invention 

In a first broad aspect, the present invention incorporates statistical measures of 
biological relevance for the candidate compositions retumed. 

Typically, biological relevance, expressed as a numerical score, or biological 
index, is determmed by statistical comparison to an established reference set of known 
and fiiUy characterised compositions, in the case of glycans a reference set such as the 
Glycosuite aittp://ww w.glvcosuite.cQm^ database. The biological index of any given 
composition may then be used as a basis for discarding biologically *\inlikely" 
compositions, as well as for ranking (sorting) of returned compositions by biological 
likeliness. 

Empirically, for glycans this allows between 90-99.9% of candidate 
compositions retumed by any given search to be discarded, whilst preserving and 
. ranking the remaining, biologically likely compositions. 

In one aspect the present invention provides a method of determining the 
likelihood of a saccharide composition of a candidate glycan comprising: 

providing a search mass of a glycan whose composition is to be determined; 

generating a list of possible glycans made up of components, including 
monosaccharides, whose total mass is within a predetermined tolerance of the search 
mass; 

selecting a reference group of known characterised glycans ; 
establishing the mean and standard deviation of each component appearing in 
the reference group of the known characterised glycans ; 
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for each candidate glycan calculating a partial score for each compoiient in that 
theoretical glycan candidate, the partial score being calculated from the mean and 
standard deviation of the component appearing in the reference group and which 
provides a measure of the likelihood of that component being present in the candidate 
glycan; and 

combining the partial scores to provide ah indication of the likelihood of that 
candidate glycan occurring. 

More particularly, for glycans, in one aspect the present invention provides a 
method of characterising glycans comprising the steps of; 
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necessary in order to obtain a sufficiently large sample size (preferably at least 100 
known compositions). In the case of the Glycosuite database of known sugar 
structures, a mass tolerance of 200 Da was empirically determined to be sufficient to 
provide in excess of 100 known compositions for search masses up to around 3500. 

By way of example if the search mass were 1000 Da there may be 100 known 
glycans in the database whose mass is between 800 and 1200 Da. The mean and 
standard deviation of each of every monosaccharide/component appearing in those 
kno\yn glycans in the database is then determined. If we take HexNAc as an example 
we may find that, on average, the 100 known glycans contain 3.3 HexNAc 
monosaccharides with a standard deviation of 2.3. This process is repeated to calculate 
the mean and standard deviation for each monosaccharide component Hex, dHex, pent 
et al, and each adduct in the known glycans, if adducts are being accounted for. 

For each candidate glycan composition "Partial scores" are then determined 
from the means and standard deviations calculated above. These are calculated for 
each monosaccharide in tiie given composition as the absolute value of flie difference 
between the mean number of that monosaccharide in the reference set and the observed 
number of fliat monosaccharide in the theoretical candidate composition, divided by 
the standard deviation of that monosaccharide in compositions from the reference set. 



le: 



. > kwean^^^^,^ — observed 
partiahcore = "^"^ '^"^monosac 



monosac 



where mearimonosac is the mean number of the given monosaccharide in the 
reference data set (Glycosuite); observedmonosac is the number of the given 
monosaccharide in the theoretical candidate composition; and stddev„onosac is the 
standard deviation of the given monosaccharide in the reference data set. 

By way of example if the theoretical glycan composition includes two HexNAc, 
three Hex and 1 NeuAc, the partial score for each of those three monosaccharides is 
calculated for that theoretical candidate glycan composition. Partial scores need not be 
calculated for monosaccharides which do not appear in the candidate theoretical glycan 
composition. 

In the event that the mean^onosac equals the observedmonosac for a particular 
glycan, the system is arranged to give the partial score a minimum value of 0,01. 

Thus, the partial score of a monosaccharide is in fact the number of standard 
deviations the number of away from the mean that that monosaccharide is in the 
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CLAIMS: 

1. A method of detennining the hkelihood of a saccharide composition . of a 
candidate glycan comprising: 

providing a search mass of a glycan whose composition is to be determined; 
5 generating a list of possible glycans made up of components, including 

monosaccharides, whose total mass is within a predetermined tolerance of the search 
mass; 

selecting a reference group of known characterised glycans ; 
establishing the mean and standard deviation of each component appearing in 
10 the reference group of the known characterised glycans ; 

for each candidate glycan calculating a partial score for each component in that 
theoretical glycan candidate, the partial score being calculated from the mean and 
standard deviation of the component appearing in the reference group and which 
provides a measure of the likelihood of that component being present in the candidate 
15 glycan; 

combining the partial scores to provide an indication of the likelihood of that 
candidate glycan occxming. 

2. A method as claimed in claim 1 wherein the reference group of glycans 
20 comprises glycans of approximately similar mass to the search mass. 

3. A method as claimed in claim 1 or 2 wherein the partial scores for each 
component are based on the difference between the observed nxmiber of the component 
in the candidate glycan composition and the mean for that component in the reference 

25 group, divided by the standard deviation and wherein the combining of the partial 
scores is carried out by multiplying the partial scores together. 



30 



4. A method as claimed in claim 1. or 2wherein the partial score for each 
component is calculated according to the equation:- 

partialscore^^^^ = K^"^ -observed„^^\ 

stdev 

monosac 

where mearimonosac is the mean number of the given monosaccharide in the 
reference data set; observedmonosac is the number of the given monosaccharide in the 
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candidate glycan; and stddevmonosac is the standard deviation of the given 
monosaccharide in the reference data set. 

5. A method as claimed in claim 1 or 2 wherein the partial score for each 
5 component is calculated according to the equation:- 



10 



PartialScoreni= 



e 2 



sPIk X stdev^ 

tn 

where StDevScorem= Abs(countni- meanni)/stdevm 

6. A method as claimed in claim 5 wherein the probability of the candidate glycan 
or "biological index" is calculated according to the equation: 
biological index = — ^= ^ — — v 

\L \jnemonosaccharides ^^^^^^^^^^K^m ) 

15 7. A method as claimed in any one of claims 1 to 6 wherein the predetermined 
tolerance of the search mass is within +/- 400Da, preferably +/- 200Da. 

8. A system for determining the likelihood of saccharide composition of a 
candidate glycan comprising a computer means running software implementing the 

20 method of any one of claims 1 to 7. . 

9. A method of determining the likelihood of a saccharide composition of a 
candidate glycan using a system as claimed in claim 8 when dependent on any one of 
claims 2 to 6 including :- 

25 inputting a search mass; 

inputting a search mass tolerance; 
inputting a biological index cut off; and 

inputting a maximum value for each component in the candidate composition. 
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